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Abstract —In this study we present a kernel based con¬ 
volution model to characterize neural responses to natural 
sounds by decoding their time-varying acoustic features. The 
model allows to decode natural sounds from high-dimensional 
neural recordings, such as magnetoencephalography (MEG), 
that track timing and location of human cortical signalling 
noninvasively across multiple channels. We used the MEG 
responses recorded from subjects listening to acoustically 
different environmental sounds. By decoding the stimulus fre¬ 
quencies from the responses, our model was able to accurately 
distinguish between two different sounds that it had never 
encountered before with 70% accuracy. Convolution models 
typically decode frequencies that appear at a certain time point 
in the sound signal by using neural responses from that time 
point until a certain fixed duration of the response. Using our 
model, we evaluated several fixed durations (time-lags) of the 
neural responses and observed auditory MEG responses to be 
most sensitive to spectral content of the sounds at time-lags of 
250 ms to 500 ms. The proposed model should be useful for 
determining what aspects of natural sounds are represented 
by high-dimensional neural responses and may reveal novel 
properties of neural signals. 

1. Introduction 

The way our brain represents periodic signals in different 
sensory modalities has been a subject of several studies. 
For example, spiking of movement-sensitive neurons in 
response to periodic signals was successfully encoded using 
the convolution model |[T| which is a linear mapping from 
time-varying neural responses to time-varying representation 
of the incoming stimuli. The model has been subsequently 
employed in many studies e.g. to investigate how the primary 
auditory cortex neurons encode spectro-temporal features in 
invasive recordings of ferrets Q and humans 0, to study 
the robustness and the extent to which perceptual aspects are 
coded in the cortical representation 0, and to characterizing 
stimulus-response function of auditory neurons Q. 

Earlier studies addressing the spectro-temporal encoding 
in the human auditory system have typically used invasive 
intracortical recordings with limited spatial coverage. For 
studying the spatio-temporal response across the entire cor¬ 
tex one can utilize MEG which can track the timings and 
location of cortical responses at high resolution. However, 
direct application of the convolution model to MEG data is 
computationally challenging, as the complexity of the model 
is directly proportional to the spatial dimensionality of the 
neural response data, which is usually very high in MEG. In 
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Figure 1. Spectrograms and fourier transforms of four sample sounds. 
The amplitude waveform of the sound is depicted below each spectrogram. 

this study we propose the kernel convolution model, which is 
a dual representation of a sparse convolution model and has 
an efficient parameter estimation scheme that is independent 
of the spatial dimensionality of neural responses. We first 
show that the presented methodology using time-varying 
acoustic features of sound stimuli, here spectrogram, is able 
to decode new sounds with high accuracy. We then evaluate 
different time-lags of the MEG responses in decoding the 
spectrogram of test sounds in a cross-validation setting. 

II. Convolution based predictive modelling 
A. Convolution model 

The convolution model 0-0 is a linear mapping be¬ 
tween the response of a population of neurons and a time- 
varying representation of the original stimulus, here spec¬ 
trogram s(t,/), sampled at times t = (see Fig. 

for the spectrograms of four example sounds used in the 
study). The mapping is performed via a convolution of the 
neural responses evoked by the sound r(t, x) with unknown 
spatio-temporal response functions ^(r, /, x) 

^(*>/) = r,a:;) + e, (1) 

X T 
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where x indexes the MEG vertices (here sensors), / repre¬ 
sent the frequency channels, r indicates the fixed duration 
(also referred to as the temporal lag above), and e is an 
additive zero mean Gaussian random variable. In this model, 
the reconstruction for each frequency channel Sf is treated 
independently of the other channels. If we consider the 
reconstruction of one channel, it can be written as 

X r 

To simplify the description of the inference algorithm used 
in this study, we transform the model in a linear algebraic 
form. First we define the response matrix R G 
such that each row (t) contains the MEG response profile 
to sound n across the entire set of sensors x at time t and 
the subsequent r time bins: 

R = 


ri(2,l) 
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and 

Sf=[sfil,l) Sf (1,2) ■■■ Sfil,T) ■■■ Sf{N,T)]^ 

Using the matrix notation, Eq[^ becomes: Sf = RGf + e, 
which is similar to multiple linear regression with weights 
Gf. Given a pre-defined lag, the function G/ is estimated by 
minimizing the mean-squared error between the actual and 
the predicted stimuli: arg min^^^ ^{sf (nR) — sf . 

Solving this results in a maximum likelihood (ML) estimate: 

Gf = {R'^R)-^R^Sf. (3) 

The estimate requires an inversion of the inner product R^ R 
that has a dimension d x d, where d = rx is the dimension 
of the MEG response data. In neuroimaging, particularly in 
MEG, the value of d is typically large. This is primarily 
due to the high spatial resolution of MEG where the data is 
sampled from hundreds to thousands vertices, x, depending 
on whether the data is represented at the sensor- or source- 
level. Further, the different sources can be highly correlated 
in MEG which makes the inversion ill-conditioned, i.e., the 
resulting inverse may not be possible to compute or it may 
be very sensitive to slight variation in the data. 
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Figure 2. Event-related responses in one subject (averaged over 20 
repetitions of the sound) at an MEG sensor located over the left hemisphere. 

B. Kernel convolution model 

By applying similar developments for linear regression 
j^, we reformulate the classical convolution model in terms 
of its kernel or dual representation and add suitable regular¬ 
ization. In this representation, we use a sparse prior on the 
response function Gf : Gf ^ N{0, where A/ > 0 is 

the regularization parameter and I is an identity matrix. The 
function can be determined by maximizing the log-posterior 
distribution of G/ which is equivalent to minimizing the 
regularized sum-of-squares error function given by 

argminy^{s/(n,f) - Sf{n,t)y + Xf'^gf{T,xf. (4) 

n,t x,T 

Solving this yields a maximum a posteriori (MAP) estimate: 

Gf ={R'^R + Xfiy^R'^Sf. (5) 

The addition of the regularization term stabilizes the estima¬ 
tion of the inverse. Following the derivation of kernel ridge 
regression the MAP estimate can be obtained using the 
dual form of the sparse convolution model: 

Gf =r'^{RR'^ + A/I)-^5'/. (6) 

Unlike the original form (Eq. or the non-sparse version: 
Eq. that required the inversion of R^R G 
the dual form requires inversion of the Gram matrix K = 
RR^ G ]R(^^)x(^^), xhis is very useful for neuroimaging 
studies where the number of conditions, N, is typically very 
low compared to the number of neural sources x while r 
and T are of the same order. To estimate A/, we follow 
and use an efficient computational technique Q, which 
avoids the inversion {RR^ + A/I)“^ for each value of A/ 
and uses a fast scoring measure to estimate leave-one-out 
error for different values of the regularization parameter. 
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The entire reconstruction of the sound spectrogram can 
be described as the collection of convolution functions for 
each frequency channel; G = [G1G2 • • • Gf]- Then, given 
an MEG response to a test sound, we take the lagged 
representation of the response, rnew ^ and obtain 

a prediction of its spectrogram Sn&w ^ as follows: 

5new = TnewG = {RR^ + XfT)~^S. (7) 

The dual formulation can be obtained by noticing that the 
prediction in Eq. operates on the feature space and only 
involves inner products. These inner products can be re¬ 
placed with a kernel function k{rn,rm) = 0(^n)^0(^m) = 
where (j)i{r) are the basis functions. If 
we substitute the kernel functions for the inner-products 
we obtain the following prediction of the spectrogram: 
^new = k(rnew)(^ + where we have defined 

the matrix k(rnew) with column-wise concatenation of sub¬ 
matrices k{rnQ^,rn). Similarly, the submatrices of K are 
defined using the kernel function k{rn,rm)- Thus, the dual 
formulation implicitly allows to use feature spaces of very 
high, even infinite, dimensions. 

C. Model evaluation 

We performed a leave-two-out cross-validation where, in 
each fold, we used all but two randomly picked sounds as 
training data. To label the held-out sounds without using any 
training examples for those sounds, we followed a two-stage 
prediction procedure, similar to |T0| . In the first stage, we 
applied the learned functions to predict the spectrograms for 
the test sound pair and concatenated the temporal dimension 
to form vectors for both predicted and original spectrograms. 
In the second stage, we quantified the predictive accuracy 
by computing the correlation between the reconstructed and 
the original spectrogram of the two test sounds. If the two 
predictions are represented as pi and p 2 and the original 
spectrograms are si and S 2 , then the labelling assigned by 
the model was considered correct if: 


corr(si,pi) + corr(s2,P2) > corr(si,p2)+ corr(s2,Pi) (8) 


This process is repeated for all possible combinations of 
leave-two-out sounds. Under this evaluation, the expected 
performance of a random model is 50%. Since the sounds 
are of different durations, to evaluate Eq. we truncated 
the predicted and original spectrogram to the length of the 
shorter sound in each test sound pair. 

To evaluate how well the spectrogram features were 
predicted, we use the following score: 


score/^t = 1 — 


E(g/,i - sf,tV 


(9) 


where s f^t is the original value of the spectrogram frequency 
/ at time t, Sf^t is the predicted value by the model, and 
Sf^t is the mean value across all pairs in the cross-validation 
combinations. The summations in Eq. are computed over 



Figure 3. Performance of kernel convolution model as a function of 
time-lag. Each point is an average over 946 cross-validation tests. Chance 
accuracy is 50%. Solid line is average accuracy across the three subjects. 


all pairs of cross-validation samples. The feature score thus 
measures percent of variation explained in each feature. 

III. Experiments 

A. MEG recordings 

The data consisted of event-related MEG responses from 
three subjects listening to common environmental sounds 
(44 items). The sounds included sets of 6 —8 items from five 
pre-selected categories (vehicles, music, human, animal and 
tool) and 8 uncategorized sounds. Each sound was presented 
20 times. 

Magnetic fields associated with neural current flow were 
recorded with a 306-channel whole-head neuromagnetome¬ 
ter (Elekta Oy, Helsinki) in the Aalto Neuroimaging MEG 
Core. The MEG signals were band-pass filtered between 
0.03 and 330 Hz and sampled at 1000 Hz. During the 
recordings subjects listened to a pseudo-randomly shuffled 
sequence of sounds and were asked to respond by finger 
lift when two consecutive sounds referred to the same item. 
Response trials were excluded from analysis. The event- 
related responses to the 20 repetitions of each stimulus were 
averaged from 300 ms before to 2000 ms after the stimulus 
onset, rejecting trials contaminated by eye movements. On 
average 19.2 ± 1.1 (mean ± standard deviation) artifact- 
free epochs (repetitions) per subject were gathered for each 
item. The averaged MEG responses were baseline-corrected 
to the 200 ms interval immediately preceding the stimulus 
onset and down-sampled to 10 ms intervals. Data analysis 
was restricted to 56 planar gradiometers above the auditory 
cortex. Example responses are depicted in Eig. 

B. Stimulus spectrogram representation 

The auditory spectrogram representation was binned at 
10 ms and calculated based on the auditory filter bank 








with 128 overlapping bandpass filter channels mimicking the 
auditory periphery (TT) Filters had logarithmically spaced 
central frequencies ranging from 180 to 7246 Hz (Fig. [T]). 

C Prediction of sound spectrograms from MEG responses 

Prior to the analysis, both spectrogram and MEG data 
were standardized to zero mean and unit variance. We used 
causal response functions (r < 0; 0), which means that the 
model decoded spectrograms of sounds at time t using neural 
responses at time t,t + l,t + 2,...,t — rms. To evaluate the 
sensitivity of MEG neural responses to the frequencies in 
the stimulus spectrogram, we evaluated the mean predictive 
accuracy across all possible leave-two-out combinations of 
44 sounds = 946 combinations) for different time-lags 
-r = 20, 100, 180, 260, 340, 420, 500, 580, 740 and 980 
ms. Results, shown in Fig. 0 indicate that it was possible 
to discriminate between two previously unencountered test 
sounds with ^70% accuracy (Mean value 70.0 to 71.9 at 
time-lag 250 to 500 ms) even when neither sound was used 
in the training data. Next, to evaluate which spectrogram 
features were best predicted, we considered the time-lag of 
500 ms that gave the optimal predictions and computed the 
mean score (Eq across the three subjects for each spec¬ 
trogram feature. The top 15 scoring features represent high 
stimulus frequencies (above 3.8 kHz) with scores ranging 
from 0.12 to 0.21. Eurther, we computed item-wise mean 
predictive accuracy over the cross-validation folds. The 
five best predicted sounds were camera (95.3%), helicopter 
(88.4%), lighting a match (84.5%), motorsaw (83.7%), and 
door (82.9%), while five least accurately predicted sounds 
were trumpet (59.7%), laughter (58.9%), yawning (56.6%), 
zipper (55.0%), and thunder (54.3%). The best predicted 
sounds typically contained higher frequencies compared to 
the less accurately predicted sounds (see Eig. [T]). 

IV. Discussion and Conclusion 

Our results demonstrate that the kernel convolution model 
provides an efficient method for predicting spectrograms 
of new sounds. Predictions are made by decoding neural 
information in high-dimensional MEG responses to com¬ 
mon environmental sounds. Therefore, the extracted neural 
information can be regarded as being based on neural 
mechanisms that generalize across a variety of sounds. We 
evaluated different time-lags in the MEG response data to 
predict spectrograms of unencountered sounds, and observed 
that the responses are most sensitive for a duration of 
around 250 — 500 ms from the input stimulus. The auditory 
evoked responses used in the analysis are most prominent at 
50 — 500 ms after the stimulus onset despite the stimulus du¬ 
ration (see Eig.0. Thus, at the longest time-lags (> 500 ms) 
the MEG data is noisier compared to shorter lags, as the 
decaying MEG responses start to show large inter-response 
variability. Neurophysiological interpretation and evaluation 
of significance of the results are natural extensions of the 


study. The decoding problem studied here is an example of 
an underdetermined systems for which regularization and 
Bayesian inference have provided reasonable answers. 

Classical linear regression has been used earlier to de¬ 
code neural responses, but most studies have either been 
limited to non-time-varying stimulus representations 0 or 
neuroimaging recordings |T0| (13 The proposed method 
will be useful for analyzing brain’s ability to understand 
sounds in an acoustic environment, particularly when neural 
responses are recorded at high spatio-temporal resolution. 
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